Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo~\cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
translated by 谷歌翻译
Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients' tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models' performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.
translated by 谷歌翻译
We introduce BotSIM, a modular, open-source Bot SIMulation environment with dialog generation, user simulation and conversation analytics capabilities. BotSIM aims to serve as a one-stop solution for large-scale data-efficient end-to-end evaluation, diagnosis and remediation of commercial task-oriented dialog (TOD) systems to significantly accelerate commercial bot development and evaluation, reduce cost and time-to-market. BotSIM adopts a layered design comprising the infrastructure layer, the adaptor layer and the application layer. The infrastructure layer hosts key models and components to support BotSIM's major functionalities via a streamlined "generation-simulation-remediation" pipeline. The adaptor layer is used to extend BotSIM to accommodate new bot platforms. The application layer provides a suite of command line tools and a Web App to significantly lower the entry barrier for BotSIM users such as bot admins or practitioners. In this report, we focus on the technical designs of various system components. A detailed case study using Einstein BotBuilder is also presented to show how to apply BotSIM pipeline for bot evaluation and remediation. The detailed system descriptions can be found in our system demo paper. The toolkit is available at: https://github.com/salesforce/BotSIM .
translated by 谷歌翻译
我们介绍了Lavis,这是一个开源深度学习库,用于语言视觉研究和应用。拉维斯(Lavis)的目标是作为一个一站式综合图书馆,它为研究人员和从业人员提供了可访问语言视觉领域的最新进步,并赋予未来的研究和发展。它具有统一的界面,可轻松访问最新的图像语言,视频语言模型和常见数据集。 Lavis支持对各种任务的培训,评估和基准测试,包括多模式分类,检索,字幕,视觉问题答案,对话和预训练。同时,该库还高度可扩展且可配置,从而促进了未来的开发和定制。在此技术报告中,我们描述了图书馆的设计原理,关键组成部分和功能,并在常见的语言视觉任务中提出基准测试结果。该库可在以下网址获得:https://github.com/salesforce/lavis。
translated by 谷歌翻译
在过去的十年中,以数据为驱动的特别基于深度学习的方法已成为机器人Grasp计划的主要范式。但是,这些方法的性能在很大程度上受到可用培训数据集质量的影响。在本文中,我们提出了一个框架来生成对象形状以增强握把数据集,从而可以提高预设计的深神经网络的掌握能力。首先,使用编码器解码器结构网络将对象形状嵌入到低维特征空间中。然后,使用异常检测和掌握质量标准计算每个对象形状的稀有性和掌握得分。最后,在特征空间中生成了新的对象,以利用原始的高稀有性和掌握分数对象的特征。实验结果表明,通过生成的对象形状可以有效提高基于学习的GRASP计划网络的掌握能力。
translated by 谷歌翻译
视频和语言预培训表明对各种下游任务有望改善。最先前的方法捕获与基于变换器的多模式编码器的跨模型交互,不完全解决单向视频和文本特征之间的错位。此外,学习细粒度的视觉语言对准通常需要离上的对象检测器来提供对象信息,这是由检测器有限的词汇和昂贵的计算成本的瓶颈。我们建议对齐和提示:一种高效有效的视频和语言预训练框架,具有更好的跨模型对齐。首先,我们介绍了一个视频文本对比(VTC)丢失,以对准实例级别的单峰视频文本功能,从而缓解跨模型交互的建模。然后,我们提出了一种新的视觉接地预训练任务,提示实体建模(PEM),旨在学习细粒度的区域实体对齐。为实现这一目标,我们首先介绍一个实体发射模块,该模块用VTC培训,以产生与实体名称实例化的视频裁剪和文本提示之间的相似性。 PEM任务然后询问模型以预测随机选择的视频作物的实体伪标签(I.E〜归一化相似度分数)。由此产生的预先训练的模型在文本 - 视频检索和VideoQ上实现了最先进的性能,通过大幅度的边距表现优于现有的工作。我们的代码和预先训练的型号将被释放。
translated by 谷歌翻译
尽管对象检测方面取得了很大进展,但由于实例级边界盒注释所需的巨大人性化,大多数现有方法都仅限于一小一少量的对象类别。为了减轻问题,最近的开放词汇和零射击检测方法试图检测培训期间未见的对象类别。但是,这些方法仍然依赖于一组基类上手动提供的边界盒注释。我们提出了一个开放的词汇检测框架,可以在没有手动提供边界盒注释的情况下培训。我们的方法通过利用预先训练的视觉语言模型的本地化能力来实现这一目标,并产生可直接用于训练对象探测器的伪边界盒标签。 Coco,Pascal VOC,Objects365和LVIS的实验结果证明了我们方法的有效性。具体而言,我们的方法优于使用人类注释的边界箱训练的最先进(SOTA),即使我们的培训源未配备手动边界盒标签,也可以在COCO新型类别上用3%AP培训。在利用手动边界箱标签作为基线时,我们的方法主要超过8%的AP。
translated by 谷歌翻译
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR 2 , ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-ofthe-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
translated by 谷歌翻译
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that bridges contrastive learning with clustering. PCL not only learns low-level features for the task of instance discrimination, but more importantly, it encodes semantic structures discovered by clustering into the learned embedding space. Specifically, we introduce prototypes as latent variables to help find the maximum-likelihood estimation of the network parameters in an Expectation-Maximization framework. We iteratively perform E-step as finding the distribution of prototypes via clustering and M-step as optimizing the network via contrastive learning. We propose ProtoNCE loss, a generalized version of the InfoNCE loss for contrastive learning, which encourages representations to be closer to their assigned prototypes. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks with substantial improvement in low-resource transfer learning. Code and pretrained models are available at https://github.com/salesforce/PCL.
translated by 谷歌翻译
Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix.
translated by 谷歌翻译